59 research outputs found

    Detection and analysis of attention errors in sequence-to-sequence text-to-speech

    Get PDF

    Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

    Get PDF
    We present a neural vocoder designed with low-powered Alternative and Augmentative Communication devices in mind. By combining elements of successful modern vocoders with established ideas from an older generation of technology, our system is able to produce high quality synthetic speech at 48kHz on devices where neural vocoders are otherwise prohibitively complex. The system is trained adversarially using differentiable pitch synchronous overlap add, and reduces complexity by relying on pitch synchronous Inverse Short-Time Fourier Transform (ISTFT) to generate speech samples. Our system achieves comparable quality with a strong (HiFi-GAN) baseline while using only a fraction of the compute. We present results of a perceptual evaluation as well as an analysis of system complexity.Comment: ICASSP 2023 submissio

    Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise

    Get PDF
    It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speechshaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice. Index Terms: HMM-based speech synthesis, intelligibility of speech in noise, Lombard speec

    Speech Waveform Reconstruction using Convolutional Neural Networks with Noise and Periodic Inputs

    Get PDF

    Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

    Get PDF

    Evaluating Cognitive Load of Text-To-Speech (TTS) synthesis

    Get PDF
    Current evaluation methods for text-to-speech (TTS) synthesis rely solely on subjective rating scores. Thesetests typically account mostly for how natural or intelligible the voice is. With state-of-the-art systems, thesemeasures are approaching ceiling and therefore alternative measures such as the cognitive load may becomemore meaningful. To our knowledge, there is little or no recent work evaluating the cognitive load of state-of- the-art text-to-speech systems. We use pupillometry as a measure of cognitive load. The pupil has beenfound to dilate upon increased cognitive effort when carrying out a listening task. Currently we are evaluatingspeech generated by a Deep Neural Network TTS synthesiser. In our method, we generate stimuli that stepincrementally from natural speech to synthesized speech by changing only a single feature at a time. Stimuli arepresented to listeners in speech-shaped noise conditions

    Differentiable Grey-box Modelling of Phaser Effects using Frame-based Spectral Processing

    Full text link
    Machine learning approaches to modelling analog audio effects have seen intensive investigation in recent years, particularly in the context of non-linear time-invariant effects such as guitar amplifiers. For modulation effects such as phasers, however, new challenges emerge due to the presence of the low-frequency oscillator which controls the slowly time-varying nature of the effect. Existing approaches have either required foreknowledge of this control signal, or have been non-causal in implementation. This work presents a differentiable digital signal processing approach to modelling phaser effects in which the underlying control signal and time-varying spectral response of the effect are jointly learned. The proposed model processes audio in short frames to implement a time-varying filter in the frequency domain, with a transfer function based on typical analog phaser circuit topology. We show that the model can be trained to emulate an analog reference device, while retaining interpretable and adjustable parameters. The frame duration is an important hyper-parameter of the proposed model, so an investigation was carried out into its effect on model accuracy. The optimal frame length depends on both the rate and transient decay-time of the target effect, but the frame length can be altered at inference time without a significant change in accuracy.Comment: Accepted for publication in Proc. DAFx23, Copenhagen, Denmark, September 202
    corecore